Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Allowed search results for Django code terms which contain stop words. #1942

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sarahboyce
Copy link
Contributor

@sarahboyce sarahboyce commented Feb 11, 2025

Refs #1097

@sarahboyce sarahboyce changed the title Allowed search results for Django code terms which contain stop words. [WIP] Allowed search results for Django code terms which contain stop words. Feb 11, 2025
@pauloxnet
Copy link
Member

@sarahboyce I'm sorry, but I don't have time to review many proposals these days.

However, I have already tried to explain in the issues you linked the problem related to the search terms that are also stop words.

I'll try to add a some considerations that may help.

PostgreSQL full-text search is heavily based on the dictionaries of the language in which the search is performed, so switching to the “simple” dictionary makes you lose many of the features, not only the removal of stop words. See the PostgreSQL documentation.

Dictionaries allow fine-grained control over how tokens are normalized. With appropriate dictionaries, you can:

  • Define stop words that should not be indexed.
  • Map synonyms to a single word using Ispell.
  • Map phrases to a single word using a thesaurus.
  • Map different variations of a word to a canonical form using an Ispell dictionary.
  • Map different variations of a word to a canonical form using Snowball stemmer rules.

https://www.postgresql.org/docs/current/textsearch-intro.html

BTW, the issue you're trying to solve in this PR only apply for searches in English, where a term in English is the same as the one used in the code. But English is only one of the various supported by the documentation, and using the “simple” dictionary could improve things for English but worsen them for other languages.

In fact, if you search for the word “through” in French you will get correct results, and we want to maintain this behavior for all languages ​​other than English.
https://docs.djangoproject.com/fr/5.1/search/?q=through

I'm not sure using the “simple” dictionary for all languages is the best solution.

Another solution I can think of to solve the problem with the English language and leave things unchanged for the other languages of the documentation is to create a custom dictionary for English. This involves removing from the stop words those that match names used in the code.

See https://www.postgresql.org/docs/current/textsearch-dictionaries.html

Here I had reported an example of creating a custom dictionary for French as an example, the procedure should be similar.

https://stackoverflow.com/a/47248109/755343

@@ -0,0 +1,38 @@
# Instructions to create a new search dictionary
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never done this before so these instructions may not be very good
I would love it if we can create this custom search dictionary in our docker setup as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job. Still few time to review properly, but I had two ideas:

  1. create a migration to create a custom English dictionary
  2. write the list of words you removed from the original list of stop words and a command/pathc/other to remove those words from the final stop words list you uploaded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ops Operations search
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants